Quandl 2: Analysis

This notebook describes the following:

Getting data from quandl,
Applying some descriptive analytics,
Applying time series analysis,
Running a linear regression,
Code: VAR,
Code: Fixed effects,
OLS with Pandas,
Logistic regression.

Note: the notebook itself aims to introduce the "analysis" interface without concetrating on details of rigorous economic research.

Resources:

Pandas computational functions

1. Getting data from quandl



In [1]:

    
import quandl



In [3]:

    
with open('quandl_key.txt','r') as f:
    key = f.read()



In [147]:

    
data = quandl.get(["FRED/GDP","FRED/UNRATE", "FRED/FEDFUNDS", "FRED/CPIAUCSL"],authtoken=key, collapse="annual")



In [148]:

    
data.head()









    Out[148]:






  
    
      
      FRED/GDP - Value
      FRED/UNRATE - Value
      FRED/FEDFUNDS - Value
      FRED/CPIAUCSL - Value
    
    
      Date
      
      
      
      
    
  
  
    
      1947-12-31
      260.3
      NaN
      NaN
      23.41
    
    
      1948-12-31
      280.7
      4.0
      NaN
      24.05
    
    
      1949-12-31
      271.0
      6.6
      NaN
      23.61
    
    
      1950-12-31
      320.3
      4.3
      NaN
      24.98
    
    
      1951-12-31
      356.6
      3.1
      NaN
      26.47



In [7]:

    
data[[0]].head()









    Out[7]:






  
    
      
      FRED/GDP - Value
    
    
      Date
      
    
  
  
    
      1947-12-31
      260.3
    
    
      1948-12-31
      280.7
    
    
      1949-12-31
      271.0
    
    
      1950-12-31
      320.3
    
    
      1951-12-31
      356.6

2. Applying some descriptive analytics



In [59]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 71 entries, 1947-12-31 to 2017-12-31
Data columns (total 4 columns):
FRED/GDP - Value         71 non-null float64
FRED/UNRATE - Value      70 non-null float64
FRED/FEDFUNDS - Value    64 non-null float64
FRED/CPIAUCSL - Value    71 non-null float64
dtypes: float64(4)
memory usage: 2.8 KB



In [61]:

    
data.describe()









    Out[61]:






  
    
      
      FRED/GDP - Value
      FRED/UNRATE - Value
      FRED/FEDFUNDS - Value
      FRED/CPIAUCSL - Value
    
  
  
    
      count
      71.000000
      70.000000
      64.000000
      71.000000
    
    
      mean
      5874.000000
      5.794286
      4.817969
      107.397915
    
    
      std
      5932.320223
      1.620510
      3.672046
      75.778956
    
    
      min
      260.300000
      2.700000
      0.070000
      23.410000
    
    
      25%
      735.750000
      4.700000
      2.287500
      31.550000
    
    
      50%
      3407.800000
      5.550000
      4.470000
      97.700000
    
    
      75%
      10199.200000
      6.600000
      6.612500
      171.700000
    
    
      max
      19027.100000
      10.800000
      18.900000
      243.846000



In [9]:

    
data.corr()









    Out[9]:






  
    
      
      FRED/GDP - Value
      FRED/UNRATE - Value
      FRED/FEDFUNDS - Value
      FRED/CPIAUCSL - Value
    
  
  
    
      FRED/GDP - Value
      1.000000
      0.179185
      -0.476201
      0.981338
    
    
      FRED/UNRATE - Value
      0.179185
      1.000000
      0.055904
      0.252364
    
    
      FRED/FEDFUNDS - Value
      -0.476201
      0.055904
      1.000000
      -0.377063
    
    
      FRED/CPIAUCSL - Value
      0.981338
      0.252364
      -0.377063
      1.000000



In [10]:

    
data.cov()









    Out[10]:






  
    
      
      FRED/GDP - Value
      FRED/UNRATE - Value
      FRED/FEDFUNDS - Value
      FRED/CPIAUCSL - Value
    
  
  
    
      FRED/GDP - Value
      3.519242e+07
      1723.718381
      -10389.773552
      441155.722619
    
    
      FRED/UNRATE - Value
      1.723718e+03
      2.626054
      0.322154
      30.935200
    
    
      FRED/FEDFUNDS - Value
      -1.038977e+04
      0.322154
      13.483921
      -103.118090
    
    
      FRED/CPIAUCSL - Value
      4.411557e+05
      30.935200
      -103.118090
      5742.450208



In [11]:

    
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline



In [13]:

    
sns.heatmap(data.corr())









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0xe7e19e8>

3. Applying time series analysis



In [70]:

    
from statsmodels.tsa.arima_model import ARIMA



In [77]:

    
from statsmodels.tsa.stattools import acf, pacf #ACF and PACF, which are later explained
acf = acf(data[[0]])
pacf = pacf(data[[0]])
plt.plot(acf) # q: number of MA components
plt.plot(pacf,'r') # p: number of AR components









    Out[77]:





[<matplotlib.lines.Line2D at 0x11a88b70>]



In [88]:

    
from statsmodels.tsa.stattools import adfuller



In [96]:

    
stationarity_test = adfuller(data["FRED/GDP - Value"])
print(stationarity_test[1])









    



0.712949320292



In [82]:

    
model = ARIMA(data[[0]], order=(1, 1, 0)) #AR(p), I(d), MA(q)



In [83]:

    
results = model.fit()



In [84]:

    
plt.plot(data[[0]].diff()) # plot the original data graph
plt.plot(results.fittedvalues, 'r') # plot the fitted graph









    Out[84]:





[<matplotlib.lines.Line2D at 0x120e8908>]



In [35]:

    
from statsmodels.tsa.arima_model import ARIMAResults



In [85]:

    
results.summary()









    Out[85]:





ARIMA Model Results

  Dep. Variable:  D.FRED/GDP - Value    No. Observations:      70   


  Model:            ARIMA(1, 1, 0)      Log Likelihood      -452.316


  Method:               css-mle         S.D. of innovations   153.929


  Date:            Mon, 03 Jul 2017     AIC                  910.631


  Time:                20:37:03         BIC                  917.377


  Sample:             12-31-1948        HQIC                 913.311


                    - 12-31-2017                                   




                              coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  const                       253.0893     74.370      3.403   0.001    107.327   398.852


  ar.L1.D.FRED/GDP - Value      0.7625      0.076     10.099   0.000      0.615     0.911


               1.3114                    +0.0000j                    1.3114                    0.0000     

Roots

                  Real           Imaginary           Modulus          Frequency


  AR.1

4. Running a linear regressions



In [57]:

    
from statsmodels.formula.api import ols



In [54]:

    
model_ols = ols(formula="data[[0]] ~ data[[1]]+data[[2]]+data[[3]]", data=data)



In [55]:

    
results_ols = model_ols.fit()



In [56]:

    
results_ols.summary()









    Out[56]:





OLS Regression Results

  Dep. Variable:         data[[0]]       R-squared:             0.977


  Model:                    OLS          Adj. R-squared:        0.976


  Method:              Least Squares     F-statistic:           865.8


  Date:              Mon, 03 Jul 2017    Prob (F-statistic):  2.52e-49


  Time:                  20:11:55        Log-Likelihood:      -525.15


  No. Observations:           64         AIC:                   1058.


  Df Residuals:               60         BIC:                   1067.


  Df Model:                    3                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   -177.1130    498.665     -0.355   0.724  -1174.591   820.365


  data[[1]]   -204.7884     75.228     -2.722   0.008   -355.266   -54.310


  data[[2]]   -188.5093     34.191     -5.513   0.000   -256.902  -120.117


  data[[3]]     75.4664      1.709     44.162   0.000     72.048    78.885




  Omnibus:         0.983    Durbin-Watson:         0.136


  Prob(Omnibus):   0.612    Jarque-Bera (JB):      1.060


  Skew:           -0.227    Prob(JB):              0.589


  Kurtosis:        2.563    Cond. No.               606.



In [67]:

    
data[[0]].hist(bins=5)









    Out[67]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010A3E7F0>]], dtype=object)



In [68]:

    
data[[0]].diff().hist()









    Out[68]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010EA05C0>]], dtype=object)



In [69]:

    
import numpy as np
np.log(data[[0]]).hist()









    Out[69]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000110100F0>]], dtype=object)



In [98]:

    
results_ols.resid.hist()









    Out[98]:





<matplotlib.axes._subplots.AxesSubplot at 0x122dcda0>



In [99]:

    
sns.distplot(results_ols.resid)









    Out[99]:





<matplotlib.axes._subplots.AxesSubplot at 0x1243ef98>

5. VAR



In [ ]:

    
from statsmodels.tsa.api import VAR
model_var = VAR(mdata[[0]])
results_var = model_var.fit()
results_var.summary()

6. Panel data



In [ ]:

    
from pandas.stats.plm import PanelOLS
reg  = PanelOLS(y=data[[0]],x=data[[1]],time_effects=True)
reg

7. OLS with Pandas



In [127]:

    
from pandas.stats.ols import OLS
linear  = OLS(y=data["FRED/GDP - Value"],x=data["FRED/UNRATE - Value"])
linear









    Out[127]:





-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         70
Number of Degrees of Freedom:   2

R-squared:         0.0321
Adj R-squared:     0.0179

Rmse:           5882.9726

F-stat (1, 68):     2.2557, p-value:     0.1378

Degrees of Freedom: model 1, resid 68

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x   656.3911   437.0393       1.50     0.1378  -200.2059  1512.9881
     intercept  2150.8781  2628.1394       0.82     0.4160 -3000.2752  7302.0314
---------------------------------End of Summary---------------------------------

8. Logistic regression



In [131]:

    
data[[0]].pct_change().mean()









    Out[131]:





FRED/GDP - Value    0.063803
dtype: float64



In [132]:

    
import numpy as np



In [149]:

    
data["GDP_status"] = np.where(data[[0]].pct_change()>data[[0]].pct_change().mean(),1,0)



In [150]:

    
data.head()









    Out[150]:






  
    
      
      FRED/GDP - Value
      FRED/UNRATE - Value
      FRED/FEDFUNDS - Value
      FRED/CPIAUCSL - Value
      GDP_status
    
    
      Date
      
      
      
      
      
    
  
  
    
      1947-12-31
      260.3
      NaN
      NaN
      23.41
      0
    
    
      1948-12-31
      280.7
      4.0
      NaN
      24.05
      1
    
    
      1949-12-31
      271.0
      6.6
      NaN
      23.61
      0
    
    
      1950-12-31
      320.3
      4.3
      NaN
      24.98
      1
    
    
      1951-12-31
      356.6
      3.1
      NaN
      26.47
      1



In [161]:

    
data["FRED/UNRATE - Value"]= data["FRED/UNRATE - Value"].fillna(data["FRED/UNRATE - Value"].mean())



In [162]:

    
from statsmodels.api import Logit
logit = Logit(data['GDP_status'], data["FRED/CPIAUCSL - Value"])
results_logit = logit.fit()
results_logit.summary()









    



Optimization terminated successfully.
         Current function value: 0.646569
         Iterations 4






    Out[162]:





Logit Regression Results

  Dep. Variable:     GDP_status       No. Observations:        71 


  Model:                Logit         Df Residuals:            70 


  Method:                MLE          Df Model:                 0 


  Date:           Mon, 03 Jul 2017    Pseudo R-squ.:       0.06385


  Time:               22:24:52        Log-Likelihood:      -45.906


  converged:            True          LL-Null:             -49.037


                                    LLR p-value:            nan 




                           coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  FRED/CPIAUCSL - Value     -0.0049      0.002     -2.438   0.015     -0.009    -0.001

	FRED/GDP - Value	FRED/UNRATE - Value	FRED/FEDFUNDS - Value	FRED/CPIAUCSL - Value
Date
1947-12-31	260.3	NaN	NaN	23.41
1948-12-31	280.7	4.0	NaN	24.05
1949-12-31	271.0	6.6	NaN	23.61
1950-12-31	320.3	4.3	NaN	24.98
1951-12-31	356.6	3.1	NaN	26.47

	FRED/GDP - Value	FRED/UNRATE - Value	FRED/FEDFUNDS - Value	FRED/CPIAUCSL - Value
count	71.000000	70.000000	64.000000	71.000000
mean	5874.000000	5.794286	4.817969	107.397915
std	5932.320223	1.620510	3.672046	75.778956
min	260.300000	2.700000	0.070000	23.410000
25%	735.750000	4.700000	2.287500	31.550000
50%	3407.800000	5.550000	4.470000	97.700000
75%	10199.200000	6.600000	6.612500	171.700000
max	19027.100000	10.800000	18.900000	243.846000

	FRED/GDP - Value	FRED/UNRATE - Value	FRED/FEDFUNDS - Value	FRED/CPIAUCSL - Value
FRED/GDP - Value	1.000000	0.179185	-0.476201	0.981338
FRED/UNRATE - Value	0.179185	1.000000	0.055904	0.252364
FRED/FEDFUNDS - Value	-0.476201	0.055904	1.000000	-0.377063
FRED/CPIAUCSL - Value	0.981338	0.252364	-0.377063	1.000000

	FRED/GDP - Value	FRED/UNRATE - Value	FRED/FEDFUNDS - Value	FRED/CPIAUCSL - Value
FRED/GDP - Value	3.519242e+07	1723.718381	-10389.773552	441155.722619
FRED/UNRATE - Value	1.723718e+03	2.626054	0.322154	30.935200
FRED/FEDFUNDS - Value	-1.038977e+04	0.322154	13.483921	-103.118090
FRED/CPIAUCSL - Value	4.411557e+05	30.935200	-103.118090	5742.450208

Dep. Variable:	D.FRED/GDP - Value	No. Observations:	70
Model:	ARIMA(1, 1, 0)	Log Likelihood	-452.316
Method:	css-mle	S.D. of innovations	153.929
Date:	Mon, 03 Jul 2017	AIC	910.631
Time:	20:37:03	BIC	917.377
Sample:	12-31-1948	HQIC	913.311
	- 12-31-2017

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
const	253.0893	74.370	3.403	0.001	107.327 398.852
ar.L1.D.FRED/GDP - Value	0.7625	0.076	10.099	0.000	0.615 0.911

Dep. Variable:	data[[0]]	R-squared:	0.977
Model:	OLS	Adj. R-squared:	0.976
Method:	Least Squares	F-statistic:	865.8
Date:	Mon, 03 Jul 2017	Prob (F-statistic):	2.52e-49
Time:	20:11:55	Log-Likelihood:	-525.15
No. Observations:	64	AIC:	1058.
Df Residuals:	60	BIC:	1067.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-177.1130	498.665	-0.355	0.724	-1174.591 820.365
data[[1]]	-204.7884	75.228	-2.722	0.008	-355.266 -54.310
data[[2]]	-188.5093	34.191	-5.513	0.000	-256.902 -120.117
data[[3]]	75.4664	1.709	44.162	0.000	72.048 78.885

Omnibus:	0.983	Durbin-Watson:	0.136
Prob(Omnibus):	0.612	Jarque-Bera (JB):	1.060
Skew:	-0.227	Prob(JB):	0.589
Kurtosis:	2.563	Cond. No.	606.

Dep. Variable:	GDP_status	No. Observations:	71
Model:	Logit	Df Residuals:	70
Method:	MLE	Df Model:	0
Date:	Mon, 03 Jul 2017	Pseudo R-squ.:	0.06385
Time:	22:24:52	Log-Likelihood:	-45.906
converged:	True	LL-Null:	-49.037
		LLR p-value:	nan